Physical Constraints and Functional Characteristics of 
Transcription Factor-DNA Interaction 



Ulrich Gerland*, J. David Moroz 1 *, and Terence Hwa 

Department of Physics, University of California at San Diego, La Jolla, CA 92093-0319 
(Dated: February 2, 2008) 

We study theoretical "design principles" for transcription factor-DNA interaction in bacteria, 
focusing particularly on the statistical interaction of the transcription factors (TF's) with the 
genomic background (i.e., the genome without the target sites). We introduce and motivate the 
concept of programmability, i.e. the ability to set the threshold concentration for TF binding 
over a wide range merely by mutating the binding sequence of a target site. This functional 
demand, together with physical constraints arising from the thermodynamics and kinetics of TF- 
DNA interaction, leads us to a narrow range of "optimal" interaction parameters. We find that 
this parameter set agrees well with experimental data for the interaction parameters of a few 
exemplary prokaryotic TF's. This indicates that TF-DNA interaction is indeed programmable. 
We suggest further experiments to test whether this is a general feature for a large class of TF's. 



With rapid advances in the sequencing and annotation 
of entire genomes, the task of understanding the asso- 
ciated regulatory networks becomes increasingly promi- 
nent. Currently many experimental and computational 
efforts are devoted to deciphering the genetic wiring dia- 
gram of a cell 0, |, §. Most of these efforts arc focused 
on locating the functional DNA binding sites of transcrip- 
tion factors (TF's). This knowledge, together with the 
genomic sequences, will provide a qualitative picture of 
which gene products may directly affect the expression 
of which genes. While obtaining such wiring diagrams is 
tremendously important for the eventual understanding 
of gene regulation at the system level, this knowledge in 
itself is not sufficient for the quantitative understanding 
of system-level effects. This has been dramatically shown 
in a detailed experimental study of the regulation of the 
endol6 gene in Sea Urchin development Q, which re- 
vealed an intricate regulatory function where a dozen or 
so TF's control the expression of a single gene. It would 
have been impossible to infer even the gross qualitative 
features of the transcriptional control from the knowledge 
of the binding sites alone. 

A major obstacle to progress is the lack of a quantita- 
tive understanding of the physical interaction between 
the TF's. However, even the simpler interaction be- 
tween TF's and DNA sequences is not so well-understood 
quantitatively: It is common to classify a potential TF- 
binding DNA sequence in a "digital" manner — either 
the sequence is designated for TF-binding or it is not. 
In this view of TF-DNA interaction, differences between 
the TF-binding sequences are only nuisances which im- 
pede straightforward bioinformatic methods of target se- 
quence discovery. On the other hand, there are plenty of 



examples where differences between target sequences are 
known to be functionally important j^j. In many cases, 
the binding of a TF to one site occurs only in the pres- 
ence of some other TF, while the binding of the same TF 
to a different site does not require other TF's. This is of- 
ten accomplished by differences in the binding sequences, 
and is believed to be the basis for combinatorial control 
and signal integration in gene regulation || . Also, differ- 
ent binding sites of the same TF can be "tuned" to bind 
at different TF concentrations, as suggested by a recent 
study of the E. coli flagella assembly system [0 . If fur- 
ther experimental studies confirm that tuning of bind- 
ing thresholds is indeed used genome wide to establish 
desired gene regulatory functions, then TF-DNA bind- 
ing should be regarded more in an "analog" instead of a 
"digital" manner. 

In this work, we report our theoretical study on the 
"design" of TF-DNA interaction, assuming the analog 
scheme of operation. Specifically, we impose the func- 
tional requirement that the threshold concentration for 
TF binding to a site can be controlled over a wide range 
by the choice of the sequence alone; we refer to this as 
the "programmability" of TF-DNA binding. Taken to- 
gether with thermodynamic and kinetic constraints, this 
functional requirement leads to a narrow range of "opti- 
mal" TF-DNA interaction parameters. We then compare 
our result to experimentally known parameters for exem- 
plary TF's to determine whether the design of these TF's 
would indeed allow the analog scheme of operation. 

To focus our discussion, we limit ourselves exclusively 
to the case of bacterial TF's which are the best charac- 
terized experimentally. We study both the equilibrium 
occupancy of a target sequence, and the dynamics of lo- 
cating the target. Von Hippel, Berg, and Winter have 
already discussed many aspects of these issues in a se- 
ries of seminal articles [|l 0, 00, 11, n2[. Our study is 
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built firmly upon their work, but includes a number of 
additional issues: (i) the effect of sequence-specific bind- 
ing to the genomic background (non-target sequences) 
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on the equilibrium occupation of a target sequence; (ii) 
kinetic traps arising statistically from the genomic back- 
ground, and (iii) the desired programmability of TF- 
DNA binding. We adopt the model developed by von 
Hippel and Berg, and allow both the sequence-specific 
and non-specific modes of TF-DNA binding. Sequence- 
specific binding occurs if the binding sequence is suffi- 
ciently close to the best binding sequence, and is governed 
quantitatively by a specificity parameter. For typical 
bacterial TF's whose binding sequences are no more than 
15 bases long, we find that our physical and functional 
requirements are best satisfied within a narrow regime of 
intermediate specificity, amounting to the loss of about 
2ksT for each additional base mismatch from the best 
binding sequence. Furthermore, the kinetic constraint fa- 
vors a low threshold to non-specfic binding, while the pro- 
grammability requirement pushes the threshold to larger 
values. The optimal trade-off value depends only on the 
genome size, and lies about IQksT above the energy of 
the best binding sequence for a genome of 10 7 bases. 
These values correspond well with the interaction param- 
eters of a number of well-characterized TF's, which sug- 
gests that programmability of TF-DNA binding is com- 
patible with the reality of protein-DNA interaction and 
may be used by the organism to accomplish biological 
functions. We hope to stimulate further experiments de- 
termining the interaction parameters for a wider range of 
TF's (see 'Discussion'). These experiments could either 
strengthen or falsify the programmibility concept, de- 
pending on whether the interaction parameters are gen- 
erally in agreement with our prediction. 



Model of Transcription Factor-DNA Interaction 

Much of our knowledge on the details of TF-DNA in- 
teraction is derived from extensive biochemical experi- 
ments on a few exemplary systems, dating back to pio- 
neering work in the late 70's [ M ® , [lC|, [111 [Ilj, PI and 
continuing through recent years (l^, [IT], |l8j T |l9jr |20fT Fur- 
thermore, detailed structural information is available for 
many TF's from various structural families [2lj| . Based 
on this knowledge, quantitative models of TF-DNA in- 
teraction have been established || |ll|, [l^, [l?]] . Together 
with the recent availability of genomic sequences, these 
models can be used to characterize the thermodynamics 
as well as the dynamics of TF's with genomic DNA in 
a cell. We briefly review the primary model of TF-DNA 
interaction in this section, which serves to introduce our 
notation and formulate the problem. 

Biochemical and structural experiments, e.g. using 
Zac-repressor 
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, have established firmly that (i) 
TF's bind closely to the DNA with a free energy AG ns 
(with respect to the cytoplasm) regardless of its sequence 
due to electrostatic interaction alone, and (ii) additional 
sequence-specific binding energy can be gained (via hy- 
drogen bonds) if the binding sequence is close to the 
recognition sequence of the TF. Let the total binding 



(free) energy of a TF to a sequence s = {s\, S2, sl} of 
L nucleotides Si € {A, C, G, T} be AG[s] (with respect to 
the cytoplasm), and let s* be the best binding sequence. 
AG[s] becomes sequence-independent, AG[s] = AG ns , if 
s is far from s* . This is believed to occur via a change 
in the conformation of the TF, from one which allows 
more hydrogen bond formation to another which brings 
the positive charges of the TF closer to the negatively 



charged DNA backbone 1 10 



For this study, it will be convenient to measure all en- 
ergies with respect to that of the best binder, AG[s**]. 
Let us define E[s\ = AG[s\ — AG[s*}. Furthermore, 
we will introduce the threshold energy E ns = AG ns — 
AG[s*] where TF-DNA binding switches from the spe- 
cific to the non-specific mode (for /oc-repressor, E ns « 
10 kcal/mole). Then given the above model of TF-DNA 
interaction, and assuming that the TF is bound to the 
DNA essentially all the time 1 , all thermodynamic quan- 
tities regarding this TF can be computed from the par- 
tition function 2 



Z = 



N 



-PE[s 



N ■ 



-0E n 



(1) 



where /3 _1 = ksT w 0.6 kcal/mole and Sj denotes the 
subsequence of the genomic sequence {si, S2, ...,Sjv} from 
position j to j + L — 1. The binding length of a typical 
bacterial TF is L = 10 ~ 20 basepairs (bp). The length 
of genomic sequence N is typically several million bp. 

The form of the binding energy E\s\ has been studied 
experimentally for several TF's [^6[ [H], [TsL [lij. In par- 
ticular, recent experiments on the TF Mnt from bacte- 
riophage P22 jl6| support the earlier model |ll| that the 
contribution of each nucleotide in the binding sequence 
to the total binding energy is approximately independent 
and additive, i.e., 



sis] = 



(2) 



For the TF's Mnt, Cro, and A-repressor, the parame- 
ters of the "energy matrix" £i{si) have actually been 
determined experimentally by in vitro measurements of 
the equilibrium binding constants K[s\ oc e~@ E ^ for ev- 
ery single-nucleotide mutant of the best binding sequence 
s* Due to our definition of the energy scale, 



1 In vivo measurements for the case of Zac-repressor found less than 
10% of the TF's were unbound Jlq |. This agrees well with an es- 
timate based on a typical prokaryotic cell volume of 3 jitm 3 , a 
genome length of 5 • 10 B bases, and a non-specific binding con- 
stant on the order of 10 4 M _1 under physiological conditions ]lq] , 
which yields a fraction of unbound TF's at a few percent level. 

2 One should also include the reverse-complement of the genomic 
sequence in the evaluation of the partition function Z. In or- 
der not to make the notation too complicated, we extend the 
definition of "genomic sequence" to include its complement. 
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£i{si) = for Si = s* and £j(sj) > for s, ^ s*; the lat- 
ter will be referred to as "mismatch energies" . While the 
simple form of the binding energy (||) will certainly not 
hold for all TF's, and di-, tri- nucleotide correlation effects 
are likely to be important in many cases (e.g., to some 
extent for /ac-repressor |^o| ) , the key results of our study 
are not sensitive to such correlations as long as there is 
a wide range of binding energies for different binding se- 
quences. Thus we will adopt the simple form (||) for this 
study. For the three well-studied TF's, the mismatch en- 
ergies are typically in the range of 1 ~ 3 fc^T's. While the 
threshold energies E ns have not been carefully measured 
for these TF's, it is believed that non-specific binding 
does not occur until the binding sequences are at least 4 
to 5 mismatches away from s* (G. Stormo, private com- 
munication) . 

Genomic Background and Target Recognition 

Thermodynamics 

Let us first consider the binding of a single TF to its 
target sequence, denoted by s t . We will assume that 
thermal equilibrium can be reached within the relevant 
cellular time scale and discuss the important kinetics is- 
sue afterwards. The effectiveness of the binding of the TF 
to its target is then described by the equilibrium binding 
probability Pt, which depends not only on the binding 
energy E t = i?[st], but also on the interaction with the 
rest of the genomic sequence. Let the contribution of this 
genomic background to the partition function be Z^, then 
the binding probability to the target is given by 

Pt = i + e p(E t -F b ) ' ( 3 ) 

where Ft, = —ksThiZb is the effective binding en- 
ergy (or free energy) of the entire genomic background. 
Eq. (p|) is a sigmoidal function of E t with a (soft) thresh- 
old at Fb, i.e., a TF binds (with probability P t > 0.5) if 
E t < Fb. Since E t > by definition, we must have 

F b > (4) 

in order for a target sequence to be recognized by a single 
TF (we consider multiple TF's below). 

The background contribution can be computed for any 
given TF and genome according to Eq. (|l|) if the bind- 
ing energy matrix, the threshold energy F ns , and the 
genomic sequence is known. We will instead seek a de- 
scription that is independent of the specifics of the ge- 
nomic sequences and energy matrices. To accomplish 
this, we observe first that for the few well-studied TF's, 
the interaction of the TF with the genomic background 
can be well approximated by the interaction of the TF 
with random nucleotide sequences of the same length and 
single- nucleotide frequencies p(s). This is illustrated in 
Fig. [j](a) where the histogram of binding energies ob- 
tained by using the binding energy matrix £i(s) for the 




FIG. 1 For the purpose of TF binding, the genome may be 
treated as random DNA plus functional target site(s): (a) 
Histogram of the specific binding energies for Cro [solid line] 
on the E. coli genome, together with the average histogram 
[circles] for Cro on random nucleotide sequences (synthesized 
with the same length and single-nucleotide frequencies as the 
E. coli genome; normalization for both histograms such that 
maximum is at TV). Except for statistical fluctuations at the 
low energy end, the histograms are indistinguishable from 
each other. The approximate position of the threshold energy 
for nonspecific binding _E ns is indicated as the thin straight 
line, (b) Energy landscape for Cro on the bacteriophage A 
DNA. The landscape appears to be random, e.g. no "fun- 
nel" guides the TF to the target site. The spatial correlation 
function of the landscape (not shown) decays quickly to zero 
beyond the scale of L = 17 for this case. Random energy 
landscapes are also found for the other two TF's with known 
energy matrices (not shown). 

TF Cro on the E. coli genome (solid line) coincides well 
with the histogram of the same energy matrix applied to 
random nucleotide sequences (circles). Moreover, there 
appears to be hardly any positional correlation in the 
binding energies along the genome, as shown by the "en- 
ergy landscape" in Fig. 0(b) (see caption for details). 
In the following, we will therefore describe the effect of 
the genomic background by treating it as a random nu- 
cleotide sequence for a generic TF. In particular, we will 
describe the genomic background partition function by 
Zb = Z sp + N ■ e _/3Bns where the contribution due to 
sequence-specific binding is 

Z SP = £ e-^M, (5) 
ses(N) 

with S(N) denoting a given collection of N random nu- 
cleotide sequences of length L, drawn according to the 
frequency p(s) for each nucleotide s. 

Even with the random sequence approximation (|^), 
computation of the background energy Fb = —ksTlnZ^ 
is nontrivial in principle: From its definition, it is clear 
that Fb is a random variable, and its precise value 
will depend on the actual collection of sequences S(N). 
We are interested in the typical value of Fb, a reason- 
able approximation of which is its statistical average, 
F b = —k B T In Z b . [We use an overbar to denote averages 
over an ensemble of different sequence collections <S(A^).] 
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Computing the average In Z\, is however difficult to do 
for an arbitrary energy matrix £i(s) short of performing 
numerical simulations. An alternative is to compute the 
ensemble average of Zh, i.e., = ^ sp + JVe~^ ns , where 



E 



-mis) 



p(s) 



s={A,C,G,T} 



(6) 



with the single- nucleotide frequencies p(s), and assume 
that 



ksT In Zb 



(7) 



This is, for example, the approach taken by Stormo and 
Fields |l7j in their analysis of the TF Mnt 3 . We note in 
passing that Zb can be more compactly written in terms 
of the density of states il sp (E) for specific binding [the 
normalized version of the histogram in Fig. 0(a)] , i.e., 



Z h = n sp (E)e 



f3E 



N 



-0E na 



(8) 



The relation (| 7|) is based on the so-called "annealed ap- 
proximation" InZ w InZ which is valid for the genomic 
sequence length N — > oo but not always appropriate for 
finite N, e.g., if the partition function is dominated by 
a few low energy terms. Much is known from statistical 
physics about systems of the type defined by the parti- 
tion function Z sp in (|^), generically known as the Ran- 
dom Energy Model or REM 4 , introduced by Derrida p2j . 
It turns out that the annealed approximation is valid as 
long as the system's entropy S is significantly larger than 
zero, reflecting the contribution of many terms in the par- 
tition sum. We will see further below that proper func- 
tion of the TF's requires the system to be in a regime 
where the annealed approximation is safely applicable. 
We will thus take the relation (Q) for granted. In this 
case, the condition (||) for the recognition of the target 
sequence by a single TF becomes 



Z h < 1. 



(9) 



Search Dynamics 



In order to carry out their function properly, TF's not 
only need to have a high equilibrium binding probability 
to their targets, but also must be able to locate them in 
a reasonably short time (e.g. less than a few minutes) 



In Re f. jrjj , the nonspecific binding was not included so that 



Zb = Z sp , and the energy scale was shifted such that 



-N. 

4 In many applications, including protein folding pa], the REM 
was introduced to approximate the random background inter- 
action. The TF-DNA interaction as defined by (pi) represents 
one of the few systems for which the REM description is directly 
applicable. 



after they have been activated by an inducer or freshly 
produced by a ribosome. This constitutes a constraint 
on the "search dynamics" of TF's. 

In their non-specific binding mode, TF's are still 
strongly associated with the DNA, but are able to dif- 
fuse (i.e., slide) randomly along the genome M M, 10|. 
However, pure one-dimensional diffusion would be an in- 
efficient search process, since it is very redundant (e.g., 
a ID random walker always returns back to the start.) 
For instance, assuming generously a ID diffusion con- 
stant of D\ « 1 /mi 2 /scc jlO|], one finds a time Tid ~ 
N 2 /Di ~ 10 6 sec for a single TF to diffuse around a bac- 
terial genome of length N w 5 x 10 6 bp (about 1mm). 
Thus, in order to find a target within a few minutes via 
ID diffusion, one would need at least 100 TF's per cell to 
search in parallel (so that the search length N is reduced 
by a factor of 100). On the other hand, there are well- 
documented examples where regulation is accomplished 
effectively by only a few TF's in a cell (e.g., about 10 for 
?ac-repressor in E. colt) . 

As studied in detail by Winter, Berg, and von Hippcl 
H H [lo), the search dynamics of TF's involves instead 
a combination of sliding along the DNA at short length 
scales and hopping between different segments of DNA 
(either over the dissociation barrier through the cyto- 
plasm, or by direct intersegment transfer); see Fig. |](a). 
This search mode is much faster (given the high DNA 
concentration inside the cell), since the dynamics is es- 
sentially 3D diffusion beyond the hopping scale, and 3D 
diffusion is much less redundant than ID diffusion. For 
example, if the TF's were not bound to the DNA at all, 
a single TF of a few nm in linear dimension £ would lo- 
cate its target in a cell volume V ce \\ of several /jm 3 in the 
average first passage time of T 3 v = V ce n/(47rlD 3 ) ~ 10 
sec, given a 3D diffusion constant on the order of D3 ~ 
10/mi 2 /sec pSj] . The search time T3D/1D f° r the com- 
bined 1D/3D diffusion under in vivo conditions can be 
estimated to be comparable to T 3 d [jH^. Hence, the 
search time is short enough to comfortably allow even 
a single TF to locate its target within the physiological 
time scale. 

In the study of the search dynamics reviewed above, 
binding of the TF to the genomic background was as- 
sumed to occur at a single energy value, namely, the 
non-specific energy AG ns H. On the other hand, the 
"energy landscape" of Fig.^l](b) clearly shows that the 
random genomic background contains many isolated sites 
with binding energies far below AG ns . These sites con- 
stitute kinetic traps which can, in principle, drastically 
impede the local search process, if the energy difference 
to their surroundings is sufficiently large 5 . Thus to fully 
understand the search dynamics, we need to character- 



5 Note that the additional sequence-specific binding energy to a 
'spurious site' in the background equally increases the kinetic 
barrier for sliding to a neighboring site as well as for dissociation 
into the cytoplasm. 
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FIG. 2 (a) Schematic illustration of the search dynamics: a 
TF (represented by a solid ellipse) moves among genomic 
DNA (lines) via a combination of ID diffusion (along the 
genome) and 3D diffusion (hopping between nearby segments) 
as illustrated by the arrows. The open circles indicate the po- 
tential kinetic traps which are sites that are preferred by the 
TF in a random background, (b) Dependence of the chemical 
potential fi on the number n of TF's in a cell for Mnt, Cro, 
and A-repressor, obtained by directly solving and inverting 
the defining equation (|l3|). The comparison with the dashed 
line fi = fcgT In n shows that fx(n) is sufficiently well described 
by the simple expression (|l^) over the regime 1 < n < 1000. 

ize the effect of kinetic traps in the genomic background: 
what is the constraint on the design of TF-DNA interac- 
tion imposed by requiring that the effect of kinetic traps 
be negligible? 

At each binding sequence Sj with energy Ej = E[sj] < 
E ns , the TF typically spends a time Tj = TQ-e 13 ^-^^, 
where To is the average 'waiting time' of the TF at a non- 
specific binding site. Along the search path of the TF, 
the average waiting time r per binding site is then simply 
given by 

t = T J2 V SP (E) [e^ E ^9(E ns -E) + 9(E - E ns )] . 

E 

( 10 ) 

Here we assumed as before that the genomic sequence 
is random so that the sequence-specific binding energy 
E can be treated as a random variable drawn from the 
distribution Q, sp (E). The second term, with the help of 
the unit step function 9(x), is used to express the fact 
that there is no kinetic trap for the (majority of) sites 
with E > E ns . 

A comparison of (|l0|) and Eq. (||) for the average par- 
tition function Z\> immediately yields the important re- 
lation 6 

r/ro^Z^e^/N, (11) 

since in Eq. (||), the second term dominates for E > E ns . 
As expected, the kinetic trap factor t/tq grows expo- 
nentially with E ns , the threshold to nonspecific binding. 



On the other hand, we note from = Z sp + N e~" ns 
(see 'Thermodynamics') that the trap factor can be 
made to be of order one so that the dynamical anal- 
ysis of Refs. J| H [l(| remains qualitatively valid, if 
Z sp < Ne~ l3E " s . The physical meaning of this condi- 
tion is that the average effect of the kinetic traps can be 
rendered small if the sum of the waiting times does not 
exceed the order of the plain diffusion time. As we will 
see, this can be accomplished by choosing the binding en- 
ergy matrix £i(s) and E ns appropriately. Combining this 
kinetic constraint with Eq. (Q), we obtain the condition 

<Ne~ pE " a <1 (12) 

for the rapid recognition of a target sequence by a single 
TF. 



Programmability of Binding Threshold 

Multiple transcription factors 

There are of course typically multiple copies of the 
same TF in the cell, and the regulatory function is ac- 
complished if any one of these TF's binds to the target 
sequence. If the cell contains n copies of a given TF, 
then the occupation probability for the target sequence, 
Eq. (^|), is replaced by the Fermi distribution (or 'Arrhe- 
nius function') P t = 1/(1 + e 13 ^^^), since each binding 
sequence can at most be occupied by one TF. The chem- 
ical potential (J.(n) is determined implicitly from the con- 
dition 7 

n = £ [Sl sp (E) + 8 EtE J ■ 1 + e ^_ M) , (13) 

E 

where the quantity in brackets represents the total den- 
sity of states. In the simplest scenario, where steric ex- 
clusion between TF's bound to the non-target sequences 
is negligible, one has |ll[ 

fi » k B T\nn + F h . (14) 

This is empirically found to be a good approximation 
for those TF's with known binding energy matrices, as 
shown in Fig. ||(b). We will adopt the form Jl^ ) for the 
chemical potential of a generic TF in this study; a general 
argument will be given later to justify this choice, even 
for the case where mulitple target sequences are present 
in the same genome. 

Using (|l4|), the occupation probability can be more 
succinctly written as P t = 1/(1 + n t /n), where 

n t = e^- F ^ (15) 



Note that this relation is actually independent of the additive 
form of the binding energy (ph. 



Here, the exclusion between overlapping binding sites can be ne- 
glected since n <S N. Also, we have not included the (unimpor- 
tant) exclusion between the specific and the unspecific binding 
mode at a given site. 
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denotes the (soft) threshold concentration of the TF for 
occupation of the target sequence. 



Programmability 

The allowed values of the background free energy Fb 
for the binding of the target sequence obviously depend 
on the TF concentration n. For example, we have the 
condition (||) for n = 1, while smaller values are allowed 
for n > 1. It thus appears that the allowed Fb's are 
different for the different TF's, since they would typi- 
cally be present in the cell with different concentrations. 
On the other hand, even for a given TF species, the de- 
sired binding threshold may not be at a single concen- 
tration for different target sites, but can vary depending 
on functional demands. For example, it can be desir- 
able to turn on different genes/operons at different TF 
concentrations, in order to maintain a temporal order in 
the expression of different operons as the concentration 
of the controlling TF gradually changes over time. This 
effect was observed recently for the E. coli flagella as- 
sembly and SOS response systems (U. Alon, private 
communication) . 

As another example, consider the case where a partic- 
ular TF A is involved in the regulation of two operons, 
X and Y. Suppose it is desired that A activates the tran- 
scription of operon X on its own at a concentration ua, 
while operon Y should only be activated if A is present 
(at the same concentration ua) together with another TF 
B that can bind cooperatively with A. It is then desirable 
to have a strong binding site for A in the regulatory re- 
gion of operon X such that its threshold ha,x < and 
a weak binding site in the regulatory region of operon Y, 
with a threshold ua.y > n A ■ The latter insures that the 
operon Y will not be accidentally activated by fluctua- 
tions in ua alone, and only when the TF B is present 
would the attractive interaction between A and B induce 
the two to bind to their targets. 

The above examples show that it is functionally de- 
sirable to have the ability to set the binding threshold 
n t of a given TF to each of its target sequence s t indi- 
vidually. As is clear from the defining expression (15), 



this can only be done through the choice of the target se- 
quence St which affects E t , since the other variable, Ft,, 
is fixed for a given TF. We refer to the ability to control 
the binding threshold n t through the choice of the target 
sequence St alone as "programmability" of the binding 
threshold. Assuming that programmability is a desirable 
feature of TF-DNA interaction (since sequence changes 
can be easily accomplished by point mutation if the func- 
tional need arises), we seek to determine the specifics of 
the TF-DNA interaction, e.g., the binding matrix £i(s), 
the length of the binding sequence L, and the thresh- 
old energy E ns , which allow the targets to be maximally 
programmable. 



Two-state model and parameter selection 

Specifically, let us require programmability of the bind- 
ing threshold over the entire range n = 1 . . . 10 3 , since 
typical cellular TF concentrations range from a few to 
a few hundred per cell. The lower bound n ~ 1 imme- 
diately imposes the condition (^) on Fb, or, taking also 
the kinetic constraint into account, the condition jl^). 
Furthermore, in order to tune n throughout the desired 
range with a reasonable resolution, it is necessary to have 
the ability to change E t from to k B Th\ 10 3 « 7k B T, in 
small increments. This requires the non-zero entries of 
the binding energy matrix £i(s) to take on small values. 
Which choices for the TF-DNA interaction parameters 
(£i(s), L, E ns ) can simultaneously satisfy the latter re- 
quirement and condition (|l^)? 

The combined effect of these physical constraints and 
functional demands is best understood by simplifying the 
energy matrix £ such that we retain the essential and 
generic aspect of sequence-specific binding, while elimi- 
nating all TF-specific details. Towards this end, we adopt 
the two-state model originally introduced by von Hippel 
and Berg 1 1 , characterizing all of the non-zero entries of 
the significant positions 6 in the energy matrix by a single 
value, i.e., 



if s = s* 
e > if s ^ s* 



(16) 



where e is a dimensionless "discrimination energy" (in 
units of ksT). It describes the energetic preference of 
the TF for the optimal binding sequence s*, and is a 
crucial parameter controlling the specificity of the TF. 
Within the two-state model, the binding energy to the 
target s t is simply e times the total number of mismatches 
between the target and the best binder s* , i.e., E[s t ] = 
£ ■ \st — s*\, where |...| denotes the Hamming distance 
between two sequences. Clearly, programmability is best 
satisfied with a small e which enhances the resolution of 
the programmable binding threshold. 

The two-state model (plf) also allows an explicit evalua- 
tion of the condition fll^ ) via the formulae (||) for Z sp . As- 
suming for simplicity equal single-nucleotide frequencies 
in the background (i.e., p(s) = 1/4), the quantity in the 
bracket of (||) is easily evaluated. We have Z sp (e,L) = 
N-( L (e) where C = E s e~ fj£ ^p{s) = (l + 3 e - £ )/4. Note 
that C _1 is in the range between 1 and 4, and can be re- 
garded as the effective size of the nucleotide "alphabet" 
as "seen" by the TF in the specific binding mode. The 
maximum value — 4 is attained if the energy matrix 
has infinite discrimination, e — > 00, while no discrimi- 
nation can be achieved at e = where £ _ = 1. In 



8 Note that the energy matrices for most TF's contain a number of 
(fixed) positions which have no strong preference for any of the 
nucleotides. We will not consider these positions in the ensuing 
discussion of the two-state model, and will use L to refer to the 
total number of significant positions. 
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FIG. 3 (a) Plot of the region where Z sp (e, L) < 1. The bound- 
ary L*(e) for TV = 10 7 is indicated by the solid line; see text. 
The dashed line ln(7V)/(ln C _1 (e) - e/(l + e £ /3)) indicates 
the onset of the glass transition in the random energy model 
where the annealed approximation breaks down. As argued in 
the text, the desired parameter regime is close to Z sp — 1, so 
that the annealed approximation is justified, (b) The binding 
threshold n as a function of the total number of mismatches r 
of the target sequence s± from the best binder s* at different 
parameter combinations (e,L). 



Fig. ||(a), we indicate the allowed region Z sp (e,L) < 1 
in the parameter space of (s,L), with the boundary 
L*{e) = In /V/ InC^te) defined by Z^(L*,e) = 1. From 
the figure, it is clear that the desire for small e pushes the 
system to the boundary at Z sp = 1 . Along the boundary, 
the smallest e is given by the largest allowable binding 
length L. For typical bacterial TF's whose binding se- 
quences are no longer than about 15 bp (usually dimers), 
we find e ~ 2. 

While the result on e is somewhat specific to the two- 



state model, the need for Z sp — ► 1 imposed by the pro- 
grammability consideration forces the threshold energy 
to take on the value 



= k B T 'In N w 16 k B T 



(17) 



(for N ~ 10 7 ) according to the condition (O) indepen- 
dent of the specifics of the binding energy matrix £. It 
also follows that 



F b «0 

so that the binding threshold is simply given by 



n[s t \ 



f3E[s t 



(18) 



(19) 



The dependences of the n on the number of mismatches 
for the two-state model are shown in Fig. ||(b). We see 
that at the optimal parameter choice of (e = 2, L = 15), 
each mismatch increases the binding threshold n by 
nearly 10-fold. In principle, further fine-tuning can be ac- 
complished by utilizing small variations in the mismatch 
energies. 



Discussion 

The key results of this study, that maximal pro- 
grammability of the binding threshold n requires the TF- 
DNA interaction to satisfy the conditions ( |l7j ) and (18) 




10 20 
E [k B T] 

FIG. 4 Graphical construction of the background free energy 
Fb and other quantities used in the text. 



can be conveniently summarized graphically using the 
density of states fi sp (F) . In Fig. |J, the density of states is 
plotted with the normalization that maxg f2 sp (F) = N, 
as indicated by the horizontal dotted line. The back- 
ground free energy Ft, can be obtained using the Legendre 
construction: One draws the line e^ E ^ Fh ' (the dashed 
line in the semi-log plot of Fig. such that it just touches 
f2 sp (F). Ft, can then be read off as the intercept of the 
dashed line on the F-axis, which should be in the vicinity 
of the origin according to ([18]). Similarly, F ns [as given 
by (|T7|)] can be read off as the F-coordinate where the 
dashed line intersects the horizontal dotted line. 

The point where the dashed line tangents O sp (F) is 
also physically meaningful: The F-coordinate of the 
tangent point gives the ensemble-averaged binding en- 
ergy E ee J2 E ESl sp (E)e-P E /Z sp . The vertical co- 
ordinate A^o of the tangent point is given by the rela- 
tion Fb = Fo — ksTlnNo, which expresses the fact that 
the dominant contribution to the background free energy 
stems from the No sequences of energy « Fo in the col- 
lection of A^ random sequences: The Boltzmann weight 
of those sequences with F > F is too small to contribute 
to the partition sum, while for F < Fo, there are too few 
sequences. 

The value of Nq is an important characteristics of the 
system. S = lnATg is known as the "entropy" of this 
system, and H = ln(N/No) is known as the "relative 
entropy" ; the latter has been used to characterize the 
specificity of the TF-DNA interaction |l7| . As mentioned 
before, the annealed approximation is only valid if many 
terms contribute to the partition sum, i.e., if Nq 3> 1 or 
S > 0. For the two-state model (16), the values of e and 
L corresponding to the line Ao = 1 is far from the line 
L*{e) selected by the maximal programmability criterion; 
this justifies the use of the annealed approximation. At 
the optimal parameter of e = 2 and L = 15, we have 
Ao « 10 3 3> 1. The corresponding relative entropy is 
H 7 (about 10 bits). 

The large value of Nq also provides us with an intu- 
itive understanding of the simple dependence (Q) of the 
chemical potential /i on the cellular TF concentration 
n; see Fig. ||(b). As mentioned already, the expression 
( p^ ) is obtained if multiple occupancy of the background 
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theory 


Mnt 


Cro 


cl 


LacR 


F h [k B T] 





-1.2 


-1.6 


-0.8 




H [bits] 


^10 


8.9 


13.5 


12.7 




E ns [k B T] 


16 


17* 






« 16 



TABLE I Comparison of the expected values of the back- 
ground free energy Fh, relative entropy H and the threshold 
to nonspecific binding E nB to the known values of these param- 
eters for Mnt, Cro, the A-repressor cl, and the Zac-repressor 
LacR. The units of these values are given by the bracket in the 
first column. A dash indicates that the value is not available. 
f see Ref. H 



sequences is negligible at the TF concentration n. Since 
there is a large number (i.e., Nq) binding sequences which 
contribute significantly to the net effect of background 
binding, multiple occupancy of these sequences is indeed 
not likely if n < N . Thus for N ~ O(10 3 ), the expres- 
sion ( |l4| ) can be taken as a good approximation of the 
chemical potential over the typical range of cellular TF 
concentration n = 1 . . . 10 3 , as shown in Fig. ||(b) for the 
3 known TF's. We expect this result to hold even if there 
are multiple target sequences, say m t , whose binding en- 
ergy E t is much lower than Eg, as long as E t > ksT In mt 
such that Fb is not affected by the addition of these target 
sequences to the density of states. Having //(n) indepen- 
dent of the number of targets is a desirable functional 
robustness property from a system perspective, since one 
wouldn't want to perturb the recognition of the TF's and 
the existing targets by the addition of a few new targets. 
It will be interesting to see to what extent this feature 
is preserved by studying the energetics of TF's with a 
large number of target sites, e.g., the catabolic repressor 
protein CRP in E. coli |j. 

Finally, we compare the values of the optimal interac- 
tion parameters according to our theory to those of the 
well-studied TF's. From the values listed in Table |, we 
see that all of the available data is in the neighborhood of 
the expectation based on the maximal programmability 
criterion. We do not suggest here that programmabil- 
ity was necessarily the selective driving force that con- 
strained the TF-DNA interaction to its observed form 
(there could be other reasons, e.g. biochemical restric- 
tions, for the interaction to be of this form). However, 
the rough correspondence between theory and observa- 
tion does indicate that it is possible (and perhaps even 
very likely) that TF's generally have the required ener- 
getics for their binding threshold to be programmable 
over a wide range. 

One obvious short-coming of the above comparison is 
that the three TF's for which the interaction parame- 
ters are known are all from bacteriophages and may not 
represent typical prokaryotic TF's. It will therefore be 
very important to experimentally determine the interac- 
tion parameters for a variety of different TF's. The re- 
sults of a sufficient number of such studies will inform us 
whether programmability is a generic feature of TF-DNA 
interaction. Knowledge of this kind can be very helpful 



in developing appropriate coarse-grained models of gene 
regulation at the system level. In particular, quantitative 
relations of the type suggested by Eq. ( pj| ) will be nec- 
essary for an eventual quantitative description of gene 
regulatory networks. Also, this knowledge would have 
important implications for the evolution of gene regula- 
tion 
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